Topic Oriented Semi-supervised Document Clustering
نویسندگان
چکیده
In our study on developing a text mining prototype system, it is needed to group documents according to author’s need. However, Traditional documents clustering are usually considered an unsupervised learning. It cannot effectively group documents under user’s need. To solve this problem, we propose a new documents clustering approach. The main contributions include: (1) Describes user’s need by using multiple-attributes topic; (2) Proposes a topic-semantic annotation algorithm; (3) Proposes an optimizing hierarchical clustering algorithm to find out the best clustering solution on clustering tree by using criterion function. Experiments have validated feasibility and effectiveness of the new approach.
منابع مشابه
Semi-Supervised Co-Clustering for Query-Oriented Theme-based Summarization
Sentence clustering plays an important role in theme-based summarization which aims to discover the topical themes defined as the clusters of highly related sentences. However, due to the short length of sentences, the word-vector cosine similarity traditionally used for document clustering is no longer suitable. To alleviate this problem, we regard a word as an independent text object rather t...
متن کاملFrom Topic Models to Semi-supervised Learning: Biasing Mixed-Membership Models to Exploit Topic-Indicative Features in Entity Clustering
We present methods to introduce different forms of supervision into mixed-membership latent variable models. Firstly, we introduce a technique to bias the models to exploit topic-indicative features, i.e. features which are apriori known to be good indicators of the latent topics that generated them. Next, we present methods to modify the Gibbs sampler used for approximate inference in such mod...
متن کاملA Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions
We propose an algorithm to effectively cluster a specific type of text documents: textual responses gathered through a survey system. Due to the peculiar features exhibited in such responses (e.g., short in length, rich in outliers, and diverse in categories), traditional unsupervised and semisupervised clustering techniques are challenged to achieve satisfactory performance as demanded by a su...
متن کاملUser-Interest-Based Document Filtering via Semi-supervised Clustering
This paper studies the task of user-interest-based document filtering, where users target to find some documents of a specific topic among a large document collection. This is usually done by a text categorization process, which divides all the documents into two categorizes: one containing all the desired documents (called positive documents) and the other containing all the other documents (c...
متن کاملComposite Kernel Optimization in Semi-Supervised Metric
Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...
متن کامل